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1. INTRODUCTION 

Dengue fever is a dangerous infectious disease whose cases have steadily increased over years. Dengue 
fever is caused by a virus that is in the saliva of Aedes mosquito that injects human body parts that varies from 
mild into severe conditions [1], [2]. As stated from Epidemiological data and Surveillance Center, Ministry of 
Health, Indonesia, in Indonesia, dengue fever is still a crucial problem, this is because the number of infections 
and the area of distribution is increasing along with the increase in mobility and population density. 

Based on the Ministry of Health Republic of Indonesia’s data, in 2019, the case fatality rate (CFR) of 
dengue fever showed a value of 0.67% on a national scale. CFR is obtained from the proportion of deaths to all 
reported cases. A province is said to have a high CFR if it exceeds 1%. One of the provinces that has a high CFR 
is East Java with 1.01%. Based on the Malang Regency Health Office, Malang Regency is the area with the 
highest number of cases and deaths from dengue fever in East Java in 2019, therefore efforts are needed to control 
the death rate from dengue fever in Malang. 

To control the mortality rate in Malang Regency, one of the efforts is to predict the number of dengue 
fever cases in the future, one of the research conducted is building a model to forecast so that the parties in charge 
could take steps and arrange policies to minimize the increase in cases and mortality rates. Several forecasts 
related to dengue fever have been carried out by utilizing weekly or monthly number of cases [3]-[5]. Based on 
the previous research, there is a fairly high correlation between the number of cases of dengue fever with rainfall, 
temperature [6], and humidity [7]. 
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Penalized regression is a regression model using penalty that aims to reduce overfitting in multiple linear 
regression [8]. In this study, Ridge, Lasso, Elastic Net, smoothly clipped absolute deviation (SCAD), and minimax 
concave penalty (MCP) were explored. In order to overcome limitations of single forecasting model, ensemble 
methods are able to increase the performance of base model with higher accuracy and identify complex object, 
and uncertainties [9]—[12]. 


2. METHODOLOGY 

Based on Figure 1, there are four big steps to build ensemble model with penalized regression. First, 
raw data are gathered from various sources. It consists of climate data (temperature, humidity, wind speed, 
rainfall) and number of dengue cases. Data cleaning is carried out to produce processed data that are ready to 
be used for model development. 
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Figure 1. Model architecture 


After raw data collection and cleaning stage, the next steps are test the correlation between variables. 
Data will be splitted into two parts that consist of training data and testing data. Training data will be used to 
train the model and the testing one will be used to measure model performances, also those datas will be used 
to determine parameter of penalized. 

In building ensemble forecasting model, five penalized regressions that consist of Ridge, Lasso, Elastic 
Net, SCAD, and MCP will be trained and validated by each. Ridge regression widely used for high dimensional 
data where independendent variables are highly correlated, this method aims to reduce multicollinearity [13], 
Lasso is a method that used regularization and variable selection to increase interpretability and accuracy [14], 
Elastic Net is a combination of Ridge and Lasso regressions, so it will retain the advantage of both methods [15], 
SCAD regression aims to improve Lasso’s penalty by reducing the bias in the model because the Lasso penalty 
tends to be linear in the size of the regression coefficient [16], and MCP is other alternative to give less biased 
variables in sparse model [17]. Penalized regression parameter will be determined. 

After evaluating each model, aggregated prediction is formed by calculating the average prediction 
results from the model (averaging). In general, the steps carried out are implemented by taking sequential data 
based on the time dimension. After the ensemble forecasting equation has been successfully formed, then 
forecasting is carried out on the dependent variable (weekly number of cases of dengue fever) using the 
ensemble forecasting model that has been formed on the test data. After the formation of forecasting models 
and predictions have been made, the analysis is carried out by predicting the magnitude of the incidence of 
dengue fever and strategy analysis. After that, the model performance test was carried out. 

This study will try to test two forms of data to get the most optimal model results. The form of data 
to be tested consists of normal data and data that has been transformed into natural logarithm (In). Based on 
the research, the natural logarithm transformation was carried out to stabilize the variance when performing 
standard regression procedures. In addition to the unstable variance (not constant), the transformation can also 
be used to correct for non-linearity and residuals that are not normally distributed (non-normality) [18]. 


3. RESULTS AND DISCUSSION 
The experiments were performed on an Intel® Core™ 15-7200U central processing unit (CPU) @ 2.50 
GHz 2.70 GHz, random-access memory (RAM) with 8 GB (gigabyte) which is running on Windows 10 home 
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single language x64 bit. The software tool used is Rstudio and R for the programming language. The steps show 
the results from each research steps that consist of splitting data, determining parameters, building penalized 
regression model, ensembling model, and compare the model’s performance with other related methods. 


3.1. Splitting data training and data testing 

To conduct training on the model, the data is divided into two parts into training data and testing data. 
Training data is part of a dataset that is trained to make predictions or perform functions from other machine 
learning algorithms according to their respective goals. Basically, the user provides clues through an algorithm 
so that the trained machine can find the correlation on its own. While data testing is part of the dataset that is 
tested to see the accuracy of the model, in other words, its performance. The distribution of overall data from 
2014 to 2018 sequentially with the proportion of training data compared to testing data of 70% and 30%. 


3.2. Determined penalized regression parameter 

The parameter used in penalized regression is the lambda value. The lambda parameter controls the 
amount of regularization applied to the regression model. The larger the lambda value, the more coefficients 
are depreciated to zero. When the lambda value is equal to 0, the regularization does not apply and the model 
runs linear regression. Lambda value with cross validation score with the smallest error value are taken for 
each model and the proportion of data [19]. Table 1 shows the selected lambda values along with the lowest 
mean squared error (MSE) score for each lambda in the Ridge, Lasso, and Elastic Net models. When compared 
with the selected lambda, Lasso and Elastic Net have a large enough lambda value from the Ridge model. This 
is because if the lambda value is greater, there is a possibility that a variable has a coefficient equal to zero. It 
can be said that several independent variables are not chosen to be predictors in Lasso and Elastic Net 
regression models, considering that Lasso regression has a variable selection feature in it [20], and Elastic Net 
is a combination of Ridge and Lasso models [21]. 

The calculation of the best lambda values for the SCAD and MCP models in Table 2 is slightly 
different from Ridge, Lasso, and Elastic Net models. The best lambda value is selected based on the lowest 
cross-validation error (CVE) value. The best lambda that can be used on SCAD is 0.908 with a CVE of 73.17, 
and MCP has a lambda of 0.520 with a CVE of 69.34. 


Table 1. Ridge, Lasso, and Elastic Net’s lambda values 
Data proportion Method Lambda values MSE 


70:30 Ridge 1.696 123.49 
70:30 Lasso 5.060 121.16 
70:30 Elastic net 6.975 117.42 


Table 2. SCAD and MCP's lambda values 
Data proportion Method _ Lambda values CVE 
70:30 SCAD 0.908 73.17 
70:30 MCP 0.520 69.34 


3.3. Building penalized regression model 

Testing data is carried out in each penalized regression model with the proportion of 70:30 for training 
and testing data. The performance of each model is measured based on the root mean squared error (RMSE) 
and symmetric mean absolute percentage error (SMAPE) values. The model was tested on both forms of data, 
namely normal and logarithmic transformation data. Since the number of cases’ smallest error numbers on 
normal data (RMSE: 6.38) is lower than logarithmic transformation (RMSE: 8.95), normal data is chosen for 
building the penalized regression model. The performance results of each penalized regression on normal data 
can be seen in Table 3. 


Table 3. Penalized regression's performances 
Method RMSE _ SMAPE 


Ridge 6.67 42% 
Lasso 7.04 39% 
Elastic Net 6.51 41% 
SCAD 6.45 37% 
MCP 6.70 39% 
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Based on the test results on normal data, the SCAD model has the best performance among other 
penalized regression models, followed by the Elastic Net, Ridge, MCP, and finally Lasso models. Based on the 
order of the smallest RMSE, the models will be combined based on scenarios based on the best RMSE value. 
When it is viewed from the prediction pattern of each model, Ridge on Figure 2 can capture the pattern quite well, 
it can be seen from the prediction pattern which tends to follow the increase and decrease in the actual data. 

The SCAD model in Figure 3 can also capture data patterns well. When compared to the Ridge and 
MCP models, SCAD tends to be more able to follow patterns in the early period with time range of October 
28, 2017 (10/28/2017) to December 28, 2017 (12/28/2017). It can be seen from the pattern of predictive data 
that tends to decrease so that the range of error values is smaller in this section. Even so, the increase in data 
that occurred in the period from October 31, 2018 (10/31/2018) to November 30, 2018 (11/30/2018) could not 
follow the pattern as good as the Ridge and MCP models. 


o 
2 
E 20 
2 
% A 
% 15 nia y 
8 it [NA 
10 jp ! 
! ' 
gj 
5 4b 
v 
0 
A A A A A A AN & & & & & & & & 
SY oY oY SY “Y' Y .Y SY OY a” sy sy oY oY oY SY Oy SY 
PN Nh Kh KY NE KY Nh Kh Kh Kh KP Kh Ah PY 
PP PP PP Sh oP oh oh Sl oh 8! 
Ww Ww Ww MW MW WW MW NM DW VOW VP WV AV WV NM SN oh 


Time 


seomns Actual Ridge 


Figure 2. Comparison between ridge vs actual 
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Figure 3. Comparison between scad vs actual 


The MCP model in Figure 4 can also follow the data pattern quite well. It can be seen from the ability 
of the prediction results to follow the actual value. Forecasting using Lasso has a variable selection feature, 
where the model will select independent variables that have relevance to the dependent variable. Even so, the 
results of the Lasso model are less able to capture the actual data pattern and tend to be less sensitive. It can be 
seen in Figure 5 where the increase in cases cannot be captured properly by the Lasso model, so this can also 
be the cause of the RMSE of the Lasso model having the greatest value among other models. The prediction 
of the Elastic Net model in Figure 6 also has a variable selection feature like Lasso's. However, Elastic Net can 
still capture patterns and spikes in the testing data well, this is because the model also combines the Ridge 
model in it, so that a combination of Ridge predictions is obtained that can handle multicollinearity [22] and 
could select variables according to existing data patterns. 


3.4. Building ensemble model 


Based on the best RMSE value of the model that has been tested further, a combination of each is 
carried out according to the scenario that has been set in. The normal data ensemble scenario in Table 4 shows 
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that the BEST II scenario with the combination of SCAD + Elastic Net models has the lowest RMSE, that 
means the model has the lowest error compared to others. And then based on low to high RMSE, the 


performance of the model followed by BEST II, ALL, and lastly BEST IV scenarios. 
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Figure 4. Comparison between MCP vs actual 
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Figure 5. Comparison between Lasso vs actual 
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Figure 6. Comparison between elastic net and actual 


Table 4. Performances of ensemble model scenario 
Model combination RMSE SMAPE 


Scenario 

BEST I SCAD + Elastic Net 6.38 39% 
BESTUI  SCAD + Elastic Net + Ridge 6.45 39% 
BESTIV = SCAD + Elastic Net + Ridge + MCP 6.51 39% 
ALL SCAD + Elastic Net + Ridge + MCP + LASSO 6.50 38% 
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The best model from the experimental results is used to predict the next 8 weeks, starting from January 
2019 to February 2019. To predict the number of dengue fever cases in the next 8 weeks, the main thing to do 
is to predict each independent variable first, such as temperature, humidity, rainfall and wind speed. In the 
independent variable forecasting process, the methods used are different depending on the data pattern. Based 
on observations, the variables of air temperature, air humidity, and wind speed have cyclical data patterns, 
where the data patterns are repeated over a long period of time [23]. Therefore, the three variables can be 
predicted using the multiplicative decomposition method [24]. The results of temperature forecasting for the 
next 8 weeks are shown in Figure 7. 
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Figure 7. Forecast of climate data 


The forecast used the best model by predicting the number of cases each week ahead, and the previous 
forecasting results are used as a lag feature in the next data. This is done 8 times until the forecasting results 
are formed. The combination of SCAD + Elastic Net with normal data which is the model with the best performance 
is used to predict the number of cases of dengue fever in Malang Regency. Forecasting period is the first 8 weeks 
of the year, from January 2019 to February 2019. From the forecasting results obtained, there will be a decrease 
in the number of dengue fever cases for the next 8 weeks. This visualization is shown in Figure 8. 
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Figure 8. Forecast of dengue cases’ numbers in 8 weeks ahead 
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When compared with the actual data on dengue fever cases in 2019 listed in Table 5, the number of 
cases in 2019 tends to increase. The cause of these differences can be caused by the presence of other factors 
or variables that cause an increase in cases of dengue fever. The dengue transmission of dengue cases in East 
Java tends to be influenced by population density, population mobility, urbanization, residental areas and in 
public places [25]. In this research, the variable used to predict is only based on climate, so it is possible that 
there are other factors than climate that are sufficient to have more influence on the increase or decrease in the 
number of dengue cases. 


Table 5. Predicted vs actual dengue cases of 2019 
Week number Actual cases of 2019 Predicted cases of 2019 


1 16 19 
2 23 16 
3 26 14 
4 27 13 
5 28 13 
6 35 13 
7 42 12 
8 27 12 


3.5. Comparison with other methods 

To find out more about whether the forecasting method using the SCAD + Elastic Net method that has 
been carried out is good enough, the forecasting results need to be compared with other methods. Comparison is 
generated by the SCAD + Elastic Net model with another method, namely multiple linear regression [26]. The 
models were compared based on the RMSE value where performances are contained in Table 6. 


Table 6. Ensemble model vs multiple linear regression 


Model RMSE SMAPE 
SCAD + Elastic Net 6.38 39% 
Multiple linear regression 6..45 41% 


In addition, in terms of determining the regression coefficient of the independent variable, the Elastic 
Net and SCAD models which are part of penalized regression have the ability to reduce the regression 
coefficient value to 0. In other words, it can eliminate independent variables that are less significant to the 
model [27]. In multiple linear regression, all independent variables such as wind velocity, rainfall, humidity, 
air temperature, lag-1, and intercept (the mean value of the response variable when all predictor equals to zero) 
are considered in the model development. It is different with the Elastic Net model, which is only | variable 
was selected, namely lag-1, lag-1 consists of the number of dengue cases that were pushed back one day from 
the original data. While in the SCAD model, humidity was eliminated from the model. These results can be 
seen in Table 7. 


Table 7. regression coefficient between multiple linear regression, SCAD, and Elastic Net 


Model Wind velocity Rainfall Humidity Temperature Lag-1 Intercept 

Multiple linear regression -1.9127 0.21351 0.00864 0.913607 0.51028 -14.893 
Elastic Net 0 0 0 0 0.23306 8.0364 
SCAD -1.9559 0.21615 0 0.9094702 0.51088 -14.047 


4. CONCLUSION 

Based on the results of the research that has been done, the following conclusions can be drawn: The 
best method for forecasting dengue fever cases is the ensemble model using a combination of SCAD + Elastic 
Net finalized regression with RMSE of 6.38. The logarithm transformation of the data on the number of cases 
does not provide better performance than normal data. It can be seen from the smallest RMSE value of the data 
from the In transformation is 8.95 and for the normal data is 6.38. Based on the results of variable selection 
from one of ensemble forming models (Elastic Net), only the lag-1 variable has a regression coefficient that is 
not equal to 0, it means that in the Elastic Net regression, only the lag-1 variable is used in constructing model. 
While in the SCAD, there is only one variable has a regression coefficient that is equal to 0. In order to improve 
the forecast performance, the selection of variables need to be reconsidered. In addition to the climate factors 
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such as temperature, humidity, rainfall and wind speed, other variables can be explored for future research such 
as population density, population mobility, economic growth, environmental sanitation, urbanization, and 
community behavior. 
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